RNA Quest: Rethinking RNA-Seq reports through reanalysis of previously published neuronal transcriptomes

Ryan Pevey

04/02/25

by Ryan Pevey

If you’re reading this right now, this is a rough draft and I’m working on outputting the final post. Feel free to check back in, drop me a message, or comment at the bottom of the page and I can keep you posted on when I’m done.

Also welcome to evolio, I hope you enjoy the experience.

Intro

Introduce the problem of reporting and the gap in communication between bioinformaticists and biologists. I sit at this nexus, with training and experiencce in boths worlds and I can act as a liason between the two. If you are not a biologist, I’ve tried to make this report focused on the viewpoint of the biologist, to help facilitate a deeper exploration of the dataset for them.

Summary

This reports the differential expression results of my re-analysis for this really nice RNA-Seq dataset from the neurons of patients with epilepsy (DOI: Pai et al., 2022). The original publication is mostly concerned with transcriptional alterations in glial cells but I’m electing to reanalyze their neuronal samples for a few reasons. First, to highlight the usefulness of previously published data. Second, so that I am not just retreading the same story over again. Third, and most selfishly, my experience lies mainly with neuronal datasets so I’m most familiar with this output. No offense to glia, they are great!

The dataset consists of bulk RNA-Seq reads from three temporal lobe control post-mortem samples (TL), and three temporal lobe with epilepsy patient samples (TLE). In the original paper, four populations of cells were isolated from the samples via Flourescence Activated Cell Sorting (FACS).

Results summary text here

Differential expression analysis

I started this project by downloading the fastq files for each sample from NCBI’s, Gene Expression Omnibus (GEO accession: GSE140393), then aligned reads to the reference genome and produced a set of counts files which contain the number of reads counted for each genetic feature in the reference transcriptome. Those counts files are the input files for this differential expression analysis. Which continues with loading all of the proper software and data files for downstream analysis. If you’re not a programmer and you’ve ever been curious about the steps that go into this type of analysis click on the buttons labelled ‘Click to expand code’ and it will expand the code that goes into each step. It’s also worth it to note that this script is only the very last stage of the analysis.

If you want all of the scripts for the full analysis including the preprocessing of the data before the differential expression, you can find it on my github repository for this project here: RNA Quest at GitHub.

library(ggplot2)
library(limma)
library(Glimma)
library(viridis)
# library(AnnotationDbi)
# library(org.Hs.eg.db)
library(DESeq2)
library(treemap)
library(gridExtra)
library(grid)
library(vsn)
library(plotly)
library(pheatmap)
library(EnhancedVolcano)
# library(htmlwidgets)
library(kableExtra)
library(DT)
library(reactable)
library(getable)
library(tibble)

###Read in datasets
#Import read count matrix
dat <- read.delim('../../data/merged.tsv', header = TRUE, sep="\t", check.names = FALSE)

#Remove special tagged rows, the first four, from the head of dat.
dat <- dat[5:length(dat[,1]),]
#Extract Ensembl ID and gene symbols
geneID <- as.data.frame(cbind(rownames(dat), dat$gene_symbols))
colnames(geneID) <- c('EnsID', 'gene_symbols')
dat <- dat[,2:7]

#Import sample metadata.
metadata <- read.delim('../../data/SraRunMetadata.csv', header=TRUE, sep=",")

#Metadata condition as factor
metadata$condition <- factor(metadata$condition)
levels(metadata$condition) <- c('TL','TLE')
metadata$ID <- c('Ctrl1', 'Ctrl2', 'Ctrl3', 'TLE1', 'TLE2', 'TLE3')

#Create sample information table
coldata <- metadata[,c('ID','condition', 'Run')]
rownames(coldata) <- metadata$ID
#Assign row names of coldata to column names of dat
colnames(dat) <- rownames(coldata)

#Import results tables
resTable <- read.csv(paste0('../../results/','TL_TLE_allGenes_DEresults.csv'), header=TRUE)

###Create differential expression object
#This block of code is not executed, but is the code of how the DESeq object (dds) was created.
#Create sample information table
# coldata <- metadata[,c('condition','ID','Run')]
# rownames(coldata) <- metadata$SampleID
# colnames(dat) #<- metadata$SampleID

#Create DESeqDataSet object
# dds <- DESeqDataSetFromMatrix(countData = dat,
#                               colData = coldata,
#                               design = ~ condition)
# dds

#Pre-filter low count genes
#Smallest group size is the number of samples in each group (3 in this dataset).
# smallestGroupSize <- 3
# keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
# dds <- dds[keep,]

#Differential expression analysis
# dds <- DESeq(dds)
#Remove nonconverged genes
# dds <- dds[which(mcols(dds)$betaConv),]

#Load RDS files for following code chunk
dds <- readRDS('../../bin/dds_DE_TL-TLE.rds')

#Build results table
#Set contrast groups, reference level should be listed last.
contrast <- c('condition', 'TLE', 'TL')
res <- results(dds,
               contrast = contrast,
               alpha = 0.05,
               pAdjustMethod = 'BH')

#Create significant results table ordered by fold change
res05 <- res[which(res$padj < 0.05),]
res05 <- res05[order(-res05$log2FoldChange),]

#Create condition labels for report titles
status <- c('Control', 'Epilepsy')

Differential Expression Analysis Results

Temporal lobe neurons of Epilepsy patients (TLE) vs. Temporal lobe neurons of Control post-mortem tissue (ref.)

Alright, the meat of the analysis. The star of the show. Keep in mind that there is a fair amount of exploratory analysis and quality control that goes into the analysis at this stage to ensure that the data is processed properly (e.g. normalization, regularization). Those steps were performed here but are presented later down the page so that we can jump right to the exciting part of the results. RESULTS SUMMARY TEXT

There are 1576 significant differentially expressed genes (FDR < 0.05) when comparing Epilepsy to Control as reference. 758 are up-regulated, 818 are down-regulated. The results of the DE analysis are summarized in the table below. Overall, roughly the same number of genes(~3%) are down-regulated compared to up-regulated genes.

Summary of results

summary(res)
## 
## out of 26557 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 758, 2.9%
## LFC < 0 (down)     : 818, 3.1%
## outliers [1]       : 36, 0.14%
## low counts [2]     : 7209, 27%
## (mean count < 22)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Gene expression dashboard

I’ve made this interactive gene expression dashboard below to help you explore the dataset. Mouse over the genes on the volcano plot to explore the highlighted significant genes. If you want to see the expression profile for that specific gene then enter it, or whatever your favorite gene is, into the gene expression plot on the right. If the gene is statistically significant, than an FDR value and bracket indicating significance are automatically populated. If not then the words “No DE result” will appear instead. If you want to restrict the gene expression plot to only display significant genes than toggle the checkbox beneath the plot. You can see the data used to create the plot with the table beneath it or download it by clicking the download graph data button to capture it as a csv. Note that this is log transformed expression data for visualization purposes, not read counts. So interpret the values accordingly.

I have the gene expression plot hosted by an onrender server then piped into this page through an iframe. So if the plot isn’t loading correctly it has probably just gone inactive and you can find it here (https://geneexprdash.onrender.com/). Actually, if you visit there then reload this page it should reactivate and load correctly.

Volcano plot

Explore Gene Expression Interactively

Heatmap

Heatmaps provide a visual representation of gene expression across multiple samples, using color gradients to indicate expression levels. Clustering patterns can reveal relationships between genes and conditions, helping to identify co-regulated genes and distinct expression profiles.

Up-regulated and Down-regulated genes: Epilepsy vs. Control (ref.)

Both the top up-regulated and down-regulated genes in the neurons of epilepsy patients.

## using 'ashr' for LFC shrinkage. If used in published research, please cite:
##     Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
##     https://doi.org/10.1093/biostatistics/kxw041


Results table: Epilepsy vs. Control (ref.)

# Add significant column. If the FDR is NA or higher than 0.05, than 'No'.
resTable$Significant <- ifelse(is.na(resTable$FDR) | resTable$FDR >= 0.05, "No", "Yes")
datTable <- cbind(resTable[, 2:7], resTable[, 16], resTable[, 9], resTable[, 11:14])
datTable[, 2:4] <- round(datTable[, 2:4], digits = 3)
datTable$pvalue <- formatC(datTable$pvalue, format = "e", digits = 2)
datTable$FDR <- formatC(datTable$FDR, format = "e", digits = 2)
datTable <- datTable[order(datTable$log2FoldChange, decreasing = TRUE), ]
# Reset the index to that row numbers match the order of the table.
rownames(datTable) <- NULL
colnames(datTable) <- c("Gene", "Base Mean", "log2FC", "LFC SE", "p-value", "FDR", "Significant?", "Ensembl ID",
    "Symbol", "MAP locus", "Gene type", "Gene description")
write.csv(datTable, file = "../../results/resTable1.csv", row.names = FALSE)
# Import datTable datTable <- read.csv(paste0('../../results/','resTable1.csv'), header=TRUE)
DT::datatable(as.data.frame(datTable), options = list(scrollX = TRUE, scrollCollapse = TRUE), caption = paste0("Table 1: These are the results of the differential expression analysis when comparing ",
    status[2], " and ", status[1], " (ref.). The columns are ordered by log2FoldChange with up-regulated genes at the top and down-regulated genes at the bottom. Numeric values listed to 3 significant digits. An FDR is not calculated for non-significant results. You can reorder the columns or search for a gene of interest by Ensembl ID or Gene name in the search bar in the top right of the table."),
    class = "display")

reacTable table

# Create reactable table
reactable(datTable, searchable = TRUE, pagination = TRUE, highlight = TRUE, striped = TRUE, bordered = TRUE,
    columns = list(log2FC = colDef(align = "right"), `Base Mean` = colDef(align = "right"), `LFC SE` = colDef(align = "right"),
        `p-value` = colDef(align = "right"), FDR = colDef(align = "right")), defaultPageSize = 25, showPageSizeOptions = TRUE)
cat(paste0("<p><strong>Table 1:</strong> These are the results of the differential expression analysis when comparing <em>",
    status[2], "</em> and <em>", status[1], " (ref.)</em>. The columns are ordered by log2FoldChange with up-regulated genes at the top and down-regulated genes at the bottom. Numeric values listed to 3 significant digits. An FDR is not calculated for non-significant results. You can search for a gene of interest by Ensembl ID or gene name using the search box above.</p>"))

Table 1: These are the results of the differential expression analysis when comparing Epilepsy and Control (ref.). The columns are ordered by log2FoldChange with up-regulated genes at the top and down-regulated genes at the bottom. Numeric values listed to 3 significant digits. An FDR is not calculated for non-significant results. You can search for a gene of interest by Ensembl ID or gene name using the search box above.

GETable table

Table 1: Differential Expression Results

Test table

GETable Test

Gene expression plots

Gene expression plots visualize how a gene’s expression varies across conditions or samples, helping identify trends, variability, and potential differential expression. Higher expression levels indicate greater transcript abundance, while statistical comparisons (e.g., boxplots, dot plots) help assess significance between groups.

Let’s take a moment to discuss interpretting these figures. It’s important to interpret these boxplots carefully, with the small sample size. The boxplot includes an in-built outlier test where any point that is more than 1.5x the IQR above or below the median value will be plotted outside of the range of “whiskers”. And generally, it would be preferrable to have a larger sample size than 3 for each group.

Interesting genes: Epilepsy vs. Control (ref.)

# Plot counts: highest foldChange gene
ens <- rownames(res)[which.max(res$log2FoldChange)]
plt <- plotCounts(dds, gene = ens, intgroup = "condition", normalized = TRUE, transform = TRUE, returnData = TRUE)
y_max <- max(plt$count, na.rm = TRUE)
padj_value <- format(res[ens, "FDR"], digits = 3)
levels(plt$condition) <- c("Control", "Epilepsy")

p <- ggplot(plt, aes(x = condition, y = count, color = condition)) + geom_boxplot(color = "black", outlier.shape = NA,
    width = 0.25) + geom_point(cex = 3, position = position_jitter(w = 0.1, h = 0)) + scale_color_viridis_d(begin = 0.75,
    end = 0.25, direction = 1) + labs(title = NULL, y = "Counts", x = "Condition") + theme_bw() + theme(text = element_text(size = 10),
    plot.title = element_text(hjust = 0.5), legend.position = "none", panel.border = element_blank(),
    panel.grid = element_blank(), axis.line = element_line("black"), axis.text.x = element_text(angle = 45,
        hjust = 1)) + scale_y_log10() + annotate("text", x = 1.5, y = y_max * 3, label = ens) + annotate("text",
    x = 1.5, y = y_max * 1.5, label = paste("FDR =", padj_value)) + annotate("segment", x = 1, xend = 2,
    y = y_max * 1.3, yend = y_max * 1.3) + annotate("segment", x = 1, xend = 1, y = y_max * 1.3, yend = y_max *
    1.2) + annotate("segment", x = 2, xend = 2, y = y_max * 1.3, yend = y_max * 1.2)

# Plot counts: lowest foldChange gene
ens <- rownames(res)[which.min(res$log2FoldChange)]
plt <- plotCounts(dds, gene = ens, intgroup = "condition", normalized = TRUE, transform = TRUE, returnData = TRUE)
y_max <- max(plt$count, na.rm = TRUE)
padj_value <- format(res[ens, "FDR"], digits = 3)
levels(plt$condition) <- c("Control", "Epilepsy")

q <- ggplot(plt, aes(x = condition, y = count, color = condition)) + geom_boxplot(color = "black", outlier.shape = NA,
    width = 0.25) + geom_point(cex = 3, position = position_jitter(w = 0.1, h = 0)) + scale_color_viridis_d(begin = 0.75,
    end = 0.25, direction = 1) + labs(title = NULL, y = "Counts", x = "Condition") + theme_bw() + theme(text = element_text(size = 10),
    plot.title = element_text(hjust = 0.5), legend.position = "none", panel.border = element_blank(),
    panel.grid = element_blank(), axis.line = element_line("black"), axis.text.x = element_text(angle = 45,
        hjust = 1)) + scale_y_log10() + annotate("text", x = 1.5, y = y_max * 3, label = ens) + annotate("text",
    x = 1.5, y = y_max * 1.5, label = paste("FDR =", padj_value)) + annotate("segment", x = 1, xend = 2,
    y = y_max * 1.3, yend = y_max * 1.3) + annotate("segment", x = 1, xend = 1, y = y_max * 1.3, yend = y_max *
    1.2) + annotate("segment", x = 2, xend = 2, y = y_max * 1.3, yend = y_max * 1.2)

# Plot counts: gene of interest
geneOfInt <- "LGR6"
ens <- geneOfInt
plt <- plotCounts(dds, gene = geneOfInt, intgroup = "condition", normalized = TRUE, transform = TRUE,
    returnData = TRUE)
y_max <- max(plt$count, na.rm = TRUE)
padj_value <- format(res[ens, "FDR"], digits = 3)
levels(plt$condition) <- c("Control", "Epilepsy")

r <- ggplot(plt, aes(x = condition, y = count, color = condition)) + geom_boxplot(color = "black", outlier.shape = NA,
    width = 0.25) + geom_point(cex = 3, position = position_jitter(w = 0.1, h = 0)) + scale_color_viridis_d(begin = 0.75,
    end = 0.25, direction = 1) + labs(title = NULL, y = "Counts", x = "Condition") + theme_bw() + theme(text = element_text(size = 10),
    plot.title = element_text(hjust = 0.5), legend.position = "none", panel.border = element_blank(),
    panel.grid = element_blank(), axis.line = element_line("black"), axis.text.x = element_text(angle = 45,
        hjust = 1)) + scale_y_log10() + annotate("text", x = 1.5, y = y_max * 3, label = ens) + annotate("text",
    x = 1.5, y = y_max * 1.5, label = paste("FDR =", padj_value)) + annotate("segment", x = 1, xend = 2,
    y = y_max * 1.3, yend = y_max * 1.3) + annotate("segment", x = 1, xend = 1, y = y_max * 1.3, yend = y_max *
    1.2) + annotate("segment", x = 2, xend = 2, y = y_max * 1.3, yend = y_max * 1.2)

subplot(ggplotly(p), ggplotly(q), ggplotly(r), nrows = 1, margin = 0.05)


Discussion

Congrats, you can now hand your report off to collaborators.


Methods

Pai et al. (DOI: 10.1186/s40478-022-01453-1) methods summary

My methods.

Scripts

This is the R script that I used for this analysis, it has been ported into this report file. The following code is not executed, but is presented for transparency and reproducibility. Again, this is only the script that I used for the differential expression analysis. If you want all of the scripts for the full analysis including the preprocessing of the data before the differential expression, you can find it on my github repository for this project here: RNA Quest at GitHub.

#This script performs differential expression analysis on the Pai et al. dataset (DOI: 10.1186/s40478-022-01453-1) using DESeq2.

library(ggplot2)
library(limma)
library(Glimma)
library(viridis)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(DESeq2)
library(tidyverse)
library(treemap)
library(vsn)
library(plotly)
library(pheatmap)
library(EnhancedVolcano)
library(htmlwidgets)

#####################
###Data Management###
#####################

###Import read count matrix for TLE study
setwd(rstudioapi::selectDirectory(caption = "Select Project Directory"))
dat <- read.delim('data/merged.tsv', header = TRUE, sep="\t", check.names = FALSE)
dat[1:6,1:6]

#Remove special tagged rows, the first four, from the head of dat.
length(dat[,1])
head(dat)[,1:3]
dat <- dat[5:length(dat[,1]),]
head(dat)[,1:3]
length(dat[,1])
#Extract Ensembl ID and gene symbols
geneID <- as.data.frame(cbind(rownames(dat), dat$gene_symbols))
colnames(geneID) <- c('EnsID', 'gene_symbols')
head(geneID)
dat <- dat[,2:7]
head(dat)

#Import sample metadata.
metadata <- read.delim('data/SraRunMetadata.csv', header=TRUE, sep=",")
head(metadata)

#Metadata condition as factor
metadata$condition <- factor(metadata$condition)
levels(metadata$condition) <- c('TL','TLE')
metadata$ID <- c('Ctrl1', 'Ctrl2', 'Ctrl3', 'TLE1', 'TLE2', 'TLE3')

#Create sample information table
coldata <- metadata[,c('ID','condition', 'Run')]
rownames(coldata) <- metadata$ID
coldata
#Assign row names of coldata to column names of dat
colnames(dat) <- rownames(coldata)
head(dat)


#######################
###DESeq DE analysis###
#######################

#Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = dat,
                              colData = coldata,
                              design = ~ condition)
dds

#Replace Ensembl ID with gene symbol were available
IDs <- rownames(dds)
head(IDs)
head(geneID)
colnames(geneID)[1:2] <- c('ensemblID', 'geneID')
#check that IDs and geneID match each other
all(IDs == geneID$ensemblID)
length(which(IDs!=geneID$ensemblID))
length(IDs)
length(geneID$ensemblID)
tail(IDs)
tail(geneID$ensemblID)
#Create a new gene feature column that uses gene symbol, if available, and Ensembl ID otherwise.
geneID$geneNew <- ifelse(is.na(geneID$geneID), 
                         geneID$ensemblID, 
                         geneID$geneID)

#Check that geneNew shows desired output for missing gene symbols
head(geneID)
#Spot check random IDs and ensemblIDs for sanity check
rndm <- round(runif(10, min = 0, max = length(IDs)), digits = 0)
IDs[rndm]
geneID[rndm,]

#Add gene annotations
gene_info <- AnnotationDbi::select(org.Hs.eg.db, 
                    keys = geneID$ensemblID, 
                    columns = c("SYMBOL", "MAP", "GENETYPE", "GENENAME"), 
                    keytype = "ENSEMBL")
head(gene_info)
#return only unique mappings
gene_info <- gene_info[isUnique(gene_info$ENSEMBL),]
#geneID and gene_info do not match yet
all(geneID$ensemblID == gene_info$ENSEMBL)
gene_info <- merge(geneID, gene_info, 
                   by.x = "ensemblID",  # Column in colData to match
                   by.y = "ENSEMBL",    # Column in gene_info to match
                   all.x = TRUE)  # Keep all colData rows
# Restore original order of colData
gene_info <- gene_info[match(geneID$ensemblID, gene_info$ensemblID), ]
#Now they do match!
all(geneID$ensemblID == gene_info$ENSEMBL)
head(geneID)
head(gene_info)

#Add geneID to dds metadata
mcols(dds) <- cbind(mcols(dds), gene_info)
rownames(dds) <- geneID$geneNew

#Pre-filter low count genes
#Smallest group size is the number of samples in each group (3 in this ).
smallestGroupSize <- 3
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep,]

#Confirm factor levels
dds$ID
dds$condition
dds$Run
# mcols(dds)$MAP

#Differential expression analysis
dds <- DESeq(dds)
#Remove nonconverged genes
dds <- dds[which(mcols(dds)$betaConv),]
boxplot(log2(counts(dds, normalized = TRUE)+1), axes = FALSE, ylab = 'log2 counts')
axis(1, labels = FALSE, at = c(seq(1,40,1)))
axis(2, labels = TRUE)
text(x = seq_along(colnames(dds)), y = -2, labels = colnames(dds), 
     # srt = 45,    #rotate
     adj = 0.5,    #justify
     xpd = TRUE, #print outside of plot area
     cex = 0.75)  #smaller font size

#Plot dispersion estimates
plotDispEsts(dds)

#Save an rds file to start the script here
# saveRDS(dds, 'bin/dds_DE_TL-TLE.rds')
# dds <- readRDS('bin/dds_DE_TL-TLE.rds')


#########################
###Build results table###
#########################

levels(dds$condition)
#Set contrast groups, reference level should be listed last.
contrast <- c('condition', 'TLE', 'TL')
res <- results(dds,
               contrast = contrast,
               alpha = 0.1,
               pAdjustMethod = 'BH')
res
summary(res)

#How many adjusted p-values are less than 0.1?
sum(res$padj < 0.1, na.rm=TRUE)
table(res$padj < 0.1)
#Create significant results table ordered by fold change
res10 <- res[which(res$padj < 0.1),]
res10 <- res10[order(-res10$log2FoldChange),]
#How many adjusted p-values are less than 0.05?
sum(res$padj < 0.05, na.rm=TRUE)
table(res$padj < 0.05)

#How many up-regulated (FDR < 0.1)
length(which(res10$log2FoldChange > 0))
#How many down-regulated (FDR < 0.1)
length(which(res10$log2FoldChange < 0))
bar <- data.frame(direction = c('up','down'), genes = c(length(which(res10$log2FoldChange > 0)),length(which(res10$log2FoldChange < 0))))
bar

#Log fold change shrinkage for visualization and ranking, especially since we only have 3 samples per group
plotMA(res, ylim=c(-3,3))
resultsNames(dds)
resLFC <- lfcShrink(dds, contrast = contrast, type = 'ashr')
resLFC
plotMA(resLFC, ylim=c(-3,3))
res <- resLFC

#Add geneID variables
res$geneNew <- row.names(res)
res_annotated <- merge(as.data.frame(res), gene_info, by="geneNew", all.x=TRUE)
head(mcols(dds)[3])
head(res_annotated)

#Write out results table
#Reorder columns and order results table by smallest p-value
colnames(res_annotated)[1] <- 'gene'
colnames(res_annotated)[6] <- 'FDR'
res_annotated <- res_annotated[order(res_annotated$FDR),]
head(res_annotated)
# write.csv(as.data.frame(res_annotated), file=paste0('results/','TL','_','TLE','_allGenes_DEresults.csv'), row.names = TRUE)

#Glimma MD plot
res.df <- as.data.frame(res_annotated)
res.df$log10MeanNormCount <- log10(res.df$baseMean)
idx <- rowSums(counts(dds)) > 0
res.df <- res.df[idx,]
res.df$FDR[is.na(res.df$FDR)] <- 1
res.df <- res.df[isUnique(res.df$gene),]
exm <- res.df[,c(1:6,13)]
rownames(exm) <- exm$gene
anno <- res.df[,c(1,7:12)]
rownames(anno) <- anno$gene
# color_vector <- viridis(100)[cut(exm$FDR, breaks=100)]
# de_colors <- setNames(viridis(3), c('DE', 'Not Sig', 'Not DE'))
de_colors <- viridis(3)
status <- ifelse(exm$FDR < 0.05 & abs(exm$log2FoldChange) > 1, 'DE',
                 ifelse(exm$FDR > 0.05 & abs(exm$log2FoldChange) < 1, 'Not Sig', 'Not DE'))

glMDPlot(exm,
         xval = 'log10MeanNormCount',
         yval = 'log2FoldChange',
         anno = anno,
         groups = dds$condition,
         # cols = de_colors,
         cols = c('#BEBEBEFF', '#440154FF', '#21908CFF'),
         status = status,
         samples = colnames(dds)
)


#########
###EDA###
#########

#Create treemap of gene biotypes for all genes
biotype_all <- data.frame(table(mcols(dds)$GENETYPE))
biotype_all$Var1 <- gsub('_',' ',biotype_all$Var1)
biotype_all <- biotype_all[order(biotype_all$Freq, decreasing = TRUE),]
# png(filename='results/figs/tree.png',width=800, height=1295)
treemap(biotype_all,
        index = 'Var1',
        vSize = 'Freq',
        type = 'index',
        palette = viridis(length(biotype_all$Var1)),
        aspRatio = 1/1.618,
        title = '',
        inflate.labels = TRUE,
        lowerbound.cex.labels = 0
)
# dev.off()

#Create treemap of gene biotypes for significant results genes
res_annotated$GENETYPE <- as.factor(res_annotated$GENETYPE)
levels(res_annotated$GENETYPE)
biotype_sig <- data.frame(
  table(
    res_annotated$GENETYPE[which(res_annotated$FDR < 0.05)]
    )
  )
biotype_sig$Var1 <- gsub('_',' ',biotype_sig$Var1)
biotype_sig <- biotype_sig[order(biotype_sig$Freq, decreasing = TRUE),]
# png(filename='results/figs/tree.png',width=800, height=1295)
treemap(biotype_sig,
        index = 'Var1',
        vSize = 'Freq',
        type = 'index',
        palette = viridis(length(biotype_sig$Var1)),
        aspRatio = 1/1.618,
        title = '',
        inflate.labels = TRUE,
        lowerbound.cex.labels = 0
)
# dev.off()
biotype <- merge(biotype_all, biotype_sig, by = 'Var1')
colnames(biotype) <- c('biotype','AllGenes','SigGenes')
biotype <- biotype[order(biotype$AllGenes, decreasing = TRUE),]
# Combine rows with low counts into 'other' for the Chi-squared test.
biotype <- rbind(biotype,c('other', sum(biotype$AllGenes[4:8]), sum(biotype$SigGenes[4:8])))
biotype <- biotype[c(1:3,9),]
biotype$AllGenes <- as.integer(biotype$AllGenes)
biotype$SigGenes <- as.integer(biotype$SigGenes)
biotype$AllProp <- round(biotype$AllGenes/sum(biotype$AllGenes), digits = 3)
biotype$SigProp <- round(biotype$SigGenes/sum(biotype$SigGenes), digits = 3)
biotype <- biotype[,c(1,2,4,3,5)]
biotype
# Perform Chi-Squared Goodness-of-Fit Test
# Compute total counts
total_all <- sum(biotype$AllGenes)  # Total genes
total_sig <- sum(biotype$SigGenes)  # Total significant genes
# Compute expected counts under the null hypothesis
biotype$Expected <- total_sig * (biotype$AllGenes / total_all)
chi_sq_test <- chisq.test(biotype$SigGenes, p = biotype$AllGenes / total_all, rescale.p = TRUE)
# Print results
print(chi_sq_test)

#VST for visualization and ranking, generally scales well for large datasets
vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
meanSdPlot(assay(vsd))
#NTD for visualization and ranking 
ntd <- normTransform(dds)
head(assay(ntd), 3)
meanSdPlot(assay(ntd))
#rlog for visualization and ranking, generally works well for small datasets
rld <- rlog(dds, blind = FALSE)
head(assay(rld), 3)
meanSdPlot(assay(rld))
#I'll use rld for visualization moving forward, because of the small sample size.

#Dimensional reduction EDA
#PCA plot
plotPCA(rld, intgroup='condition')
plotPCA(rld, intgroup='condition', pcsToUse=3:4)

#MDS plot
# rld <- calcNormFactors(rld)
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
mdsData <- data.frame(cmdscale(sampleDistMatrix))
mds <- cbind(mdsData, as.data.frame(colData(rld)))
ggplot(mds, aes(X1,X2,color=condition,shape=condition)) + geom_point(size=3)

#######################
###Visualize results###
#######################

#Create a function that will output a standalone html file of a plotly figure for integration into the subsequent markdown report.
# setwd('results/figs/')
# outdir <- getwd()
saveInteractivePlot <- function(plot, outdir, filename) {
  # Ensure the output directory exists
  if (!dir.exists(outdir)) {
    dir.create(outdir, recursive = TRUE)
  }
  # Construct full file path
  file_path <- file.path(outdir, filename)
  # Save the widget
  saveWidget(plot, file_path, selfcontained = TRUE)
  # Read the saved HTML file
  html_content <- readLines(file_path)
  # Write the modified HTML file back
  writeLines(html_content, file_path)
  message("Interactive plot saved successfully: ", file_path)
}

#Data Quality assessment by sample clustering and visualization
ann_colors = viridis(3, begin = 0, end = 1, direction = -1)
ann_colors = list(Condition = c(TL=ann_colors[2], TLE=ann_colors[3]))
# df <- as.data.frame(colData(dds)[,'condition'])
df <- data.frame(Condition = colData(dds)[, 'condition'])
rownames(df) <- colnames(dds)  # Ensure row names match colnames in heatmap

select <- order(rowMeans(counts(dds, normalized = TRUE)),
                decreasing = TRUE)[1:10]
#Declare expression matrix
# rld <- rld[,c(which(rld$condition=='TL'),which(rld$condition=='TLE'))]
rld <- rld[,c(which(rld$condition==contrast[2]),which(rld$condition==contrast[3]))]
# Extract sample names in the correct order (TL first, then TLE)
ordered_samples <- colnames(rld)[order(rld$condition)]
pheatmap(assay(rld)[select,ordered_samples],
         cluster_rows = TRUE,
         show_rownames = TRUE,
         cluster_cols = FALSE,
         annotation_col = df,
         annotation_colors = ann_colors,
         color = viridis(n=256, begin = 0, end = 1, direction = -1),
         labels_row = mcols(rld)$geneNew
         )

# select <- assay(rld)[head(order(res$FDR),100),]
#Up-regulated genes
geneNum <- 40
selectUp <- assay(rld)[rownames(res10),ordered_samples][1:geneNum,]
selectUpNames <- rownames(res10)[1:geneNum]
selectUpNames <- which(rownames(dds)%in%selectUpNames)
zUp <- (selectUp - rowMeans(selectUp))/sd(selectUp) #z-score
pheatmap(zUp,
         cluster_rows = TRUE,
         show_rownames = TRUE,
         cluster_cols = FALSE,
         annotation_col = df,
         annotation_colors = ann_colors,
         labels_row = mcols(rld)$geneNew[selectUpNames],
         color = viridis(n=256, begin = 0, end = 1, direction = -1))

#Down-regulated genes
geneNum <- 40
selectDown <- assay(rld)[rownames(res10),ordered_samples][(length(res10[,1])-geneNum):length(res10[,1]),]
selectDownNames <- rownames(res10)[(length(res10[,1])-geneNum):length(res10[,1])]
selectDownNames <- which(rownames(dds)%in%selectDownNames)
zDown <- (selectDown - rowMeans(selectDown))/sd(selectDown) #z-score
pheatmap(zDown,
         cluster_rows = TRUE,
         show_rownames = TRUE,
         cluster_cols = FALSE,
         annotation_col = df,
         annotation_colors = ann_colors,
         labels_row = mcols(rld)$geneNew[selectDownNames],
         color = viridis(n=256, begin = 0, end = 1, direction = -1))

#Up and Down-regulated genes
selectUpDownNames <- c(selectUpNames,selectDownNames)
zUpDown <- rbind(zUp, zDown) #z-score
# setwd('results/figs/')
# tiff(filename= paste0('TLE','_','TL','_upDownHeatmap.tiff'),width=1200, height=1000)
gap_index <- sum(df$Condition == "TL")
pheatmap(zUpDown,
         fontsize = 8,
         cluster_rows = TRUE,
         show_rownames = TRUE,
         cluster_cols = FALSE,
         annotation_col = df,
         annotation_colors = ann_colors,
         # gaps_col = gap_index,
         labels_row = mcols(rld)$geneNew[selectUpDownNames],
         labels_col = "",
         show_colnames = TRUE,
         color = viridis(n=256, begin = 0, end = 1, direction = -1))
# dev.off()

#Volcano plot
EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'pvalue',
                # title = paste0(contrast[2], ' vs. ', contrast[3]),
                title = bquote('TL vs. TLE'),
                subtitle = NULL,
                legendPosition = 'none',
                pCutoff = 10e-6,
                FCcutoff = 1,
                boxedLabels = TRUE,
                drawConnectors = TRUE,
                colConnectors = 'black',
                widthConnectors = 1.0,
                arrowheads = FALSE,
                col = c('grey30', viridis(3, direction = -1)), colAlpha = 0.7,
                gridlines.major = FALSE,
                gridlines.minor = FALSE) #+ coord_flip()


#Interactive volcano plot
# Set thresholds
fdr_cutoff <- 0.001
logfc_cutoff <- 1
# Set colors
sig_colors <- viridis(3, direction = -1)
# Set tooltip text
res_annotated$tooltip <- paste0(
  res_annotated$gene,
  "<br>log₂FC: ", round(res_annotated$log2FoldChange, 3),
  "<br>FDR: ", signif(res_annotated$FDR, 3)
)

res_annotated$Significant <- ifelse(
  res_annotated$FDR < fdr_cutoff & abs(res_annotated$log2FoldChange) > logfc_cutoff, "Sig",
  ifelse(
    res_annotated$FDR > fdr_cutoff & abs(res_annotated$log2FoldChange) >= logfc_cutoff, "nonDE",
    "nonSig"
  )
)

volcano <- ggplot(res_annotated, aes(x = log2FoldChange, y = -log10(pvalue), text = tooltip)) +
  geom_point(aes(color = Significant), alpha = 0.7) +
  scale_color_manual(values = c("nonSig" = "gray40", "Sig" = sig_colors[3], "nonDE" = sig_colors[1])) +
  labs(
    x = "log₂ Fold Change",
    y = "-log₁₀(p-value)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white", color = NA),
    plot.background = element_rect(fill = "white", color = NA),
    axis.line = element_line(color = "black"),
    axis.ticks = element_line(color = "black"),
    axis.text = element_text(color = "black"),
    axis.title = element_text(color = "black")
  ) +
  theme(legend.position = "none") +
  # Horizontal line (significance)
  geom_hline(
    yintercept = 5,
    linetype = "dashed",
    color = "black",
    linewidth = 0.3
  ) +
  # Vertical lines (logfc)
  geom_vline(
    xintercept = -logfc_cutoff,
    linetype = "dashed",
    color = "black",
    linewidth = 0.3
  ) +
  geom_vline(
    xintercept = logfc_cutoff,
    linetype = "dashed",
    color = "black",
    linewidth = 0.3
  )

ggplotly(volcano, tooltip = "text")


#Plot counts highest foldChange gene
plotCounts(dds, gene = rownames(res)[which.max(res$log2FoldChange)], intgroup = 'condition', normalized = TRUE, transform = TRUE)

plt <- plotCounts(dds, gene = rownames(res)[which.max(res$log2FoldChange)], intgroup = 'condition', normalized = TRUE, transform = TRUE, returnData=TRUE)

ggplot(plt, aes(x=condition, y=count)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) +
  labs(title = rownames(res)[which.max(res$log2FoldChange)]) +
  # labs(title = rownames(res)[which.min(res$padj)]) +
  theme(plot.title = element_text(hjust = 0.5)) +
  #scale_y_log10(breaks=c(25,100,400))
  scale_y_log10()

#Create annotation function for significance bracket
annotate_significance <- function(p) {
  # Extract gene of interest data
  y_max <- max(plt$count, na.rm = TRUE)
  padj_value <- res[ens, "padj"]
  # Compute positions for bracket and text
  y_bracket <- y_max * 1.30  # 5% above max expression
  y_text <- y_max * 1.5     # 10% above max
  # Add annotation to the plot
  p <- p +
    # Horizontal line (bracket)
    annotate("segment", x = 1, xend = 2, y = y_bracket, yend = y_bracket) +
    # Vertical ticks at each end
    annotate("segment", x = 1, xend = 1, y = y_bracket, yend = y_bracket * 0.90) +
    annotate("segment", x = 2, xend = 2, y = y_bracket, yend = y_bracket * 0.90) +
    # P-value text annotation
    annotate("text", x = 1.5, y = y_text, 
             label = paste("FDR =", format(padj_value, digits=3)))
  p  # Return the modified plot
}

#Plot counts gene of interest
geneOfInt <- 'LGR6'
ens <- which(res$geneNew == geneOfInt)
# plotCounts(dds, gene = rownames(res)[ens], intgroup = 'group')

plt <- plotCounts(dds, gene = rownames(res)[ens], intgroup = 'condition', normalized = TRUE, transform = TRUE, returnData=TRUE)

p <- ggplot(plt, aes(x=condition, y=count, color = condition)) + 
  # geom_boxplot(color = 'black', outlier.shape = NA) +
  geom_point(position=position_jitter(w=0.1,h=0)) +
  scale_color_viridis_d(begin = 0.75, end = 0.25, direction = 1) +
  labs(title = geneOfInt) +
  # labs(title = rownames(res)[which.min(res$padj)]) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = 'none',
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.line = element_line('black')) +
  #scale_y_log10(breaks=c(25,100,400))
  scale_y_log10()
p

max(plt$count)
# setwd('results/figs/')
# tiff(filename= paste0(geneOfInt,'_plotCounts.tiff'),width=800, height=1295)
p <- ggplot(plt, aes(x=condition, y=count, color = condition)) + 
  geom_boxplot(color = 'black', outlier.shape = NA, width = 0.25) +
  geom_point(cex = 5, position=position_jitter(w=0.1,h=0)) +
  scale_color_viridis_d(begin = 0.75, end = 0.25, direction = 1) +
  labs(title = bquote(italic(.(geneOfInt))),
       y = 'Counts',
       x = NULL) +
  theme_bw() +
  theme(text = element_text(size=30),
        plot.title = element_text(hjust = 0.5),
        legend.position = 'none',
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.line = element_line('black'),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_log10()
p

#Add significance bracket
p + 
  annotate("text",
         x=1.5,
         y=max(plt$count)*1.5,
         label=paste("FDR =",format(res[ens, "padj"],digits=3))
         ) +
  annotate("segment", x = 1, xend = 2, y = max(plt$count) * 1.30, yend = max(plt$count) * 1.30, size = 0.5)

# dev.off()

annotate_significance(p)

p <- ggplot(plt, aes(x=condition, y=count, color = condition)) + 
  geom_boxplot(color = 'black', outlier.shape = NA, width = 0.25) +
  geom_point(cex = 5, position=position_jitter(w=0.1,h=0)) +
  scale_color_viridis_d(begin = 0.75, end = 0.25, direction = 1) +
  labs(title = geneOfInt,
       y = 'Counts',
       x = NULL) +
  theme_bw() +
  theme(text = element_text(size=30),
        plot.title = element_text(hjust = 0.5),
        legend.position = 'none',
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.line = element_line('black'),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_log10()
  # annotate("text",
  #          x=1.5,
  #          y=max(plt$count)*1.05,
  #          label=paste("FDR =",format(res[ens, "padj"],digits=3))
  #          )

ggplotly(p)
p_pltly <- ggplotly(annotate_significance(p))
p_pltly

#Save the interactive plot out as a standalone html file
#Uncomment the next line to run the function from source
# source("bin/rnaQuest_html.R")
saveInteractivePlot(p_pltly, 'results/figs/', paste0(geneOfInt, "_plotlyCounts.html"))

#Export an rld expression file for use in the gene expression dashboard app (dash app)

#Get expression data from rld
rld_matrix <- assay(rld)  # genes x samples
rld_long <- as.data.frame(rld_matrix) %>%
  rownames_to_column("gene") %>%
  pivot_longer(-gene, names_to = "sample", values_to = "expression")

#Add sample metadata
sample_meta <- colData(dds) %>%
  as.data.frame() %>%
  rownames_to_column("sample")

rld_joined <- rld_long %>%
  left_join(sample_meta, by = "sample")

rld_joined <- rld_joined %>%
  mutate(gene = toupper(gene))

#Add DE results (log2FC, FDR)
de_results <- as.data.frame(res_annotated) %>%
  select(gene, log2FoldChange, FDR)

#Remove duplicates and keep the row with the lowest FDR
de_results <- de_results %>%
  filter(!is.na(FDR)) %>%
  group_by(gene) %>%
  slice_min(order_by = FDR, n = 1, with_ties = FALSE) %>%
  ungroup()

de_results <- de_results %>%
  mutate(gene = toupper(gene))

#Merge everything
final_df <- rld_joined %>%
  left_join(de_results, by = "gene")

final_df %>%
  filter(!is.na(FDR)) %>%
  select(gene, log2FoldChange, FDR) %>%
  distinct() %>%
  head()
#rows missing log2foldchange of FDR are not significantly DE.

#Save for Dash
write_csv(final_df, "results/dash_expression_data.csv")


#Gene set enrichment analysis
#Coming soon!

Exploratory data analysis

Exploratory Data Analysis (EDA) is the process of visually and statistically examining data to identify patterns, trends, and potential issues before formal analysis. It helps ensure data quality, detect outliers, and guide analytical decisions. If you aren’t doing the analyses yourself, you might normally not even be presented these figures. But they’re important sanity checks to ensure data quality. I’m presenting this here, but these steps were performed before the results presented above. I’m putting this here for the sake of completeness.

All samples total gene counts

Gene counts across all samples are shown both before and after normalization. The first plot illustrates that raw counts are already within a comparable range across samples. Normalization further refines the data, ensuring consistency for downstream analysis, as seen in the second plot.

par(mfrow = c(1, 2))

boxplot(log2(counts(dds, normalized = FALSE) + 1), axes = FALSE, ylab = "log2 counts")
axis(1, labels = FALSE, at = c(seq(1, 40, 1)))
axis(2, labels = TRUE)
title("Non-normalized Gene counts")
text(x = seq_along(colnames(dds)), y = -2, labels = colnames(dds), adj = 0.5, xpd = TRUE, cex = 0.75)

boxplot(log2(counts(dds, normalized = TRUE) + 1), axes = FALSE, ylab = "log2 counts")
axis(1, labels = FALSE, at = c(seq(1, 40, 1)))
axis(2, labels = TRUE)
title("Normalized Gene counts")
text(x = seq_along(colnames(dds)), y = -2, labels = colnames(dds), adj = 0.5, xpd = TRUE, cex = 0.75)

par(mfrow = c(1, 1))

Plot dispersion estimates

A diagnostic plot of how well the data fits the model created by DESeq2 for differential expression testing. Good fit of the data to the model produces a scatter of points (black and blue) around the fitted curve (red). Low expressing genes, with counts below 10 reads across all samples, have been filtered out for computational efficiency.

# Plot dispersion estimates, good fit of the data to the model produces a scatter of points around
# the fitted curve.
plotDispEsts(dds)
title("Dispersion estimates")

Dimensional reduction analyses

In a way a lot of omics analysis in really a game of dimensional reduction. You start off with a huge dataset where you have gene expression data for roughly 20,000 genes, each of them it’s own dimension. But some of the analysis techniques and especially visualizations must be done on a lower number of dimensions, a small handful or maybe even just two.

In comes Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) plots, they serve similar purposes and are interpretted in roughly the same way for our purposes here. PCA and MDS are dimensionality reduction techniques used in RNA-Seq analysis to visualize patterns of similarity or difference between samples. They transform high-dimensional gene expression data into a smaller number of dimensions, in this case two. Thereby making it easier to detect clustering by condition, batch effects, or outliers. These plots provide an intuitive overview of global transcriptional variation across the dataset.

They differ in how they reduce dimensions. Critically, MDS is plotted in a way such that distances between samples are preserved as much as possible in the projection to lower dimensions. This is similar to the difference between T-SNE plots and UMAP plots, where the distance between clusters is more meaningful in UMAPs while it lacks interpretability in T-SNE plots. However, since MDS is based on preserving pair-wise distance you don’t natively get a measure of how much ‘information’ each dimension carries. That’s in contrast to the measure of variance explained by each dimension that you get out of PCA plots. In that way they both have benefits and are a good compliment to each other.

plotPCA(rld, intgroup = "condition") + labs(title = "PCA Plot") + theme_bw() + scale_color_viridis_d(begin = 0.75,
    end = 0.25, direction = 1) + theme(text = element_text(size = 10), plot.title = element_text(hjust = 0.5),
    legend.position = "right", panel.border = element_blank(), panel.grid = element_blank(), axis.line = element_line("black"))
## using ntop=500 top features by variance
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
mdsData <- data.frame(cmdscale(sampleDistMatrix))
mds <- cbind(mdsData, as.data.frame(colData(rld)))
ggplot(mds, aes(X1, X2, color = condition)) + geom_point(size = 3) + labs(title = "MDS Plot", x = "Dim 1",
    y = "Dim 2") + theme_bw() + scale_color_viridis_d(begin = 0.75, end = 0.25, direction = 1) + theme(text = element_text(size = 10),
    plot.title = element_text(hjust = 0.5), legend.position = "none", panel.border = element_blank(),
    panel.grid = element_blank(), axis.line = element_line("black"))

Up-regulated vs. Down-regulated genes

We can see here that there are roughly the same number of up-regulated as down regulated genes as was noted in the results summary and could be seen in the volcano plot. 758 are up-regulated, 818 are down-regulated.

bar <- data.frame(direction = c("Down", "Up"), genes = c(length(which(res05$log2FoldChange < 0)), length(which(res05$log2FoldChange >
    0))))

# Create barplot
ggplot(bar, aes(x = direction, y = genes, fill = direction)) + geom_col(position = "stack") + scale_fill_viridis(2,
    begin = 0.3, end = 0.7, direction = 1, discrete = TRUE) + labs(x = NULL, y = "Number of genes", fill = "Direction") +
    theme(legend.title = element_blank(), panel.grid.major = element_blank(), panel.background = element_blank(),
        legend.position = "none", axis.line = element_line(color = "black"), axis.ticks = element_line(color = "black"))

Gene Biotype treemap for all genes

The treemap below shows the relative proportion of each gene biotype for all genes in the dataset. The area of the rectangle is proportional to the percentage of each biotype. The area of each of the rectangles is proportional to the percentage of each biotype when comparing Epilepsy and Control (ref.). The majority are protein coding which is normal, but protein coding genes make up a significantly higher proportion of differentially expressed genes (right figure).

# Create treemap of gene biotypes for all genes.
biotype_all <- data.frame(table(mcols(dds)$GENETYPE))
biotype_all$Var1 <- gsub("_", " ", biotype_all$Var1)
biotype_all <- biotype_all[order(biotype_all$Freq, decreasing = TRUE), ]
# remove rows with 0 counts biotype_all <- biotype_all[biotype_all$Freq > 0, ]

# Create treemap of gene biotypes for differentially expressed genes.
resTable$GENETYPE <- as.factor(resTable$GENETYPE)
# levels(resTable$GENETYPE)
biotype_sig <- data.frame(table(resTable$GENETYPE[which(resTable$FDR < 0.05)]))
biotype_sig$Var1 <- gsub("_", " ", biotype_sig$Var1)
biotype_sig <- biotype_sig[order(biotype_sig$Freq, decreasing = TRUE), ]
# remove rows with 0 counts biotype_sig <- biotype_sig[biotype_sig$Freq > 0, ]

# Set up a custom layout layout(matrix(c(1, 2), nrow = 1, ncol = 2))

treemap(biotype_all, index = "Var1", vSize = "Freq", type = "index", palette = viridis(length(biotype_all$Var1)),
    aspRatio = 1.618/1, title = "All Genes", inflate.labels = TRUE, lowerbound.cex.labels = 0)

treemap(biotype_sig, index = "Var1", vSize = "Freq", type = "index", palette = viridis(length(biotype_sig$Var1)),
    aspRatio = 1.618/1, title = "DE Genes", inflate.labels = TRUE, lowerbound.cex.labels = 0)

biotype <- merge(biotype_all, biotype_sig, by = "Var1")
colnames(biotype) <- c("biotype", "AllGenes", "SigGenes")
biotype <- biotype[order(biotype$AllGenes, decreasing = TRUE), ]
# Combine rows with low counts into 'other' for the Chi-squared test.
biotype <- rbind(biotype, c("other", sum(biotype$AllGenes[4:8]), sum(biotype$SigGenes[4:8])))
biotype <- biotype[c(1:3, 9), ]
biotype$AllGenes <- as.integer(biotype$AllGenes)
biotype$SigGenes <- as.integer(biotype$SigGenes)
biotype$AllProp <- round(biotype$AllGenes/sum(biotype$AllGenes), digits = 3)
biotype$SigProp <- round(biotype$SigGenes/sum(biotype$SigGenes), digits = 3)
biotype <- biotype[, c(1, 2, 4, 3, 5)]
# biotype Perform Chi-Squared Goodness-of-Fit Test Compute total counts
total_all <- sum(biotype$AllGenes)  # Total genes
total_sig <- sum(biotype$SigGenes)  # Total significant genes
datatable(as.data.frame(biotype), options = list(scrollX = TRUE, scrollCollapse = TRUE), rownames = FALSE,
    caption = paste0("These are the percent of each gene biotype when comparing ", status[2], " and ",
        status[1], " (ref.). The majority are protein coding, a proportion that increases sees a statistically significant increase in genes that are significantly differentially expressed."))
# Compute expected counts under the null hypothesis
biotype$Expected <- total_sig * (biotype$AllGenes/total_all)
chi_sq_test <- chisq.test(biotype$SigGenes, p = biotype$AllGenes/total_all, rescale.p = TRUE)
# Print results print(chi_sq_test)

There is a statistically significant difference in the proportion of gene types when comparing the full dataset to the subset of differentially expressed genes (X2 = 117.371, df = 3, p-value = 2.842e-25).

Session info

Session Info provides details about the computational environment used for this analysis, including the versions of R, loaded packages, and system settings. This ensures that the analysis can be reproduced accurately in the future, even if software updates change how certain functions behave.

sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Phoenix
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] tibble_3.2.1                getable_1.0.3              
##  [3] reactable_0.4.4             DT_0.33                    
##  [5] kableExtra_1.4.0            EnhancedVolcano_1.22.0     
##  [7] ggrepel_0.9.6               pheatmap_1.0.12            
##  [9] plotly_4.10.4               vsn_3.72.0                 
## [11] gridExtra_2.3               treemap_2.4-4              
## [13] DESeq2_1.44.0               SummarizedExperiment_1.34.0
## [15] Biobase_2.64.0              MatrixGenerics_1.16.0      
## [17] matrixStats_1.5.0           GenomicRanges_1.56.2       
## [19] GenomeInfoDb_1.40.1         IRanges_2.38.1             
## [21] S4Vectors_0.42.1            BiocGenerics_0.50.0        
## [23] viridis_0.6.5               viridisLite_0.4.2          
## [25] Glimma_2.14.0               limma_3.60.6               
## [27] ggplot2_3.5.1              
## 
## loaded via a namespace (and not attached):
##  [1] formatR_1.14            rlang_1.1.4             magrittr_2.0.3         
##  [4] gridBase_0.4-7          compiler_4.4.1          systemfonts_1.2.1      
##  [7] vctrs_0.6.5             stringr_1.5.1           pkgconfig_2.0.3        
## [10] crayon_1.5.3            fastmap_1.2.0           XVector_0.44.0         
## [13] labeling_0.4.3          promises_1.3.2          rmarkdown_2.29         
## [16] UCSC.utils_1.0.0        preprocessCore_1.66.0   purrr_1.0.4            
## [19] xfun_0.51               zlibbioc_1.50.0         cachem_1.1.0           
## [22] jsonlite_1.8.8          later_1.4.1             DelayedArray_0.30.1    
## [25] BiocParallel_1.38.0     irlba_2.3.5.1           parallel_4.4.1         
## [28] R6_2.6.1                bslib_0.9.0             stringi_1.8.4          
## [31] RColorBrewer_1.1-3      SQUAREM_2021.1          jquerylib_0.1.4        
## [34] Rcpp_1.0.14             knitr_1.49              httpuv_1.6.15          
## [37] Matrix_1.7-0            igraph_2.1.4            tidyselect_1.2.1       
## [40] rstudioapi_0.17.1       abind_1.4-8             yaml_2.3.10            
## [43] codetools_0.2-20        affy_1.82.0             lattice_0.22-6         
## [46] shiny_1.10.0            withr_3.0.2             evaluate_1.0.3         
## [49] xml2_1.3.6              pillar_1.10.1           affyio_1.74.0          
## [52] BiocManager_1.30.25     generics_0.1.3          invgamma_1.1           
## [55] truncnorm_1.0-9         munsell_0.5.1           scales_1.3.0           
## [58] ashr_2.2-63             xtable_1.8-4            glue_1.7.0             
## [61] lazyeval_0.2.2          tools_4.4.1             data.table_1.17.0      
## [64] locfit_1.5-9.11         tidyr_1.3.1             crosstalk_1.2.1        
## [67] edgeR_4.2.2             colorspace_2.1-1        GenomeInfoDbData_1.2.12
## [70] cli_3.6.3               mixsqp_0.3-54           S4Arrays_1.4.1         
## [73] svglite_2.1.3           dplyr_1.1.4             gtable_0.3.6           
## [76] reactR_0.6.1            sass_0.4.9              digest_0.6.36          
## [79] SparseArray_1.4.8       farver_2.1.2            htmlwidgets_1.6.4      
## [82] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7             
## [85] statmod_1.5.0           mime_0.12